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Abstract 

We  consider  a  controlled  Markov  chain  whose  transition  probabilities  and 
initial  distribution  are  parametrized  by  an  unknown  parameter  9  belonging 
to  some  known  parameter  space  6.  There  is  a  one-step  reward  associated 
with  each  pair  of  control  and  the  following  state  of  the  process.  The  ob¬ 
jective  is  to  maximize  the  expected  value  of  the  sum  of  one  step  rewards 
over  an  infinite  horizon.  By  introducing  the  Loss  associated  with  a  control 
scheme,  we  show  that  our  problem  is  equivalent  to  minimizing  this  Loss. 
We  define  uniformly  good  adaptive  control  schemes  and  restrict  attention  to 
these  schemes.  We  develop  a  lower  bound  on  the  Loss  associated  with  any 
uniformly  good  control  scheme.  Finally,  we  construct  an  adaptive  control 
scheme  whose  Loss  equals  the  lower  bound,  and  is  therefore  optimal. 


1.  Introduction 


Consider  the  following  stochastic  adaptive  control  problem:  The  system  is  mod¬ 
elled  by  a  controlled  Markov  chain  with  an  unknown  parameter,  i.e. 


{^n+i  —  y\Xn  —  xi  Xn-x, . . .  ,X0,  Un> . . . ,  Uq }  =  P(x,  y\Un,9)  (1-1) 


where  X0,  U0,  Xi,  U\, . . . ,  Xn,  Un,  Xn+i, ...  is  the  chronological  sequence  of  states 
and  control  actions,  and  9  is  an  unknown  parameter  belonging  to  some  known 


parameter  space  0;  and 


V9{X0  =  i)  =  p(x;0) 


.#0 


where  9  is  the  same  as  in  (1.1).  There  is  a  one-step  reward  r(X„,<7„),  associated 
with  each  pair  ( Xn ,  £/„),  n  >0.  The  objective  is  to  find  an  adaptive  control  scheme 
which  maximizes,  in  some  sense,  the  expected  value  of  the  sum  of  one-step  rewards 

n—  1 

E$Jn  =  E»  £  r(An,  Un),  as  n  -*•  oo  .  (1.3) 


One  of  the  current  approaches  to  stochastic  adaptive  control  problems  is  the  y' 

/ 

/ 

so  called  “Certainty  Equivalent  Control  with  Forcing”  (cf  [1]).  This  scheme  is  \  c0' 

self-tuning  in  the  Cesaro  sense  and  is  therefore  also  optimal  for  an  average  reward  ~ 

per  unit  time  criterion  (cf  [1]).  The  reward  criterion  described  by  (1.3)  suggests 

that  we  need  to  determine  the  maximum  rate  of  increase  of  E»Jn  as  n  — ►  oo.slon  For 

GFAJtl 

This  requirement  introduces  a  notion  of  optimality  that  is  stronger  than  the  oneTAfe 

'•.inced 

suggested  by  the  average  reward  per  unit  time  criterion  used  in  (1)  -  [7].  For  the  ic-.*.ticn_ 


1;  v  jt  r  Hut  lor:/ 
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criterion  (1.3)  it  is  no  longer  clear  that  the  Certainty  Equivalent  Control  with 
Forcing  is  optimal. 

The  same  reward  criterion  as  (1.3)  was  previously  used  in  [8]  for  the  study 
of  the  controlled  i.i.d.  process  problem.  This  criterion  was  initially  used  by  Lai 
and  Robbins  [9],  [10]  for  the  multi-armed  bandit  problem.  Various  extensions  of 
the  Lai  and  Robbins  formulation  of  the  multi-armed  bandit  problems  have  been 
reported  in  [11]  and  [12].  In  this  paper  we  show  that  the  adaptive  control  problem 
of  Markov  chains  can  be  viewed  as  bandit  problem  with  Markovian  rewards.  Such 
a  relation  provides  a  convenient  way  of  analyzing  the  problem,  and  allows  us  to 
develop  an  “efficient”  adaptive  control  scheme.  (We  shall  precisely  define  what  we 
mean  by  efficient  in  Section  3.) 

2.  The  Problem 

2.1  The  System  Model 

Consider  a  stochastic  system  described  by  a  controlled  Markov  chain  on  the 
state  space  X ,  with  control  set  U,  transition  probability  matrix 

P(u,9)  :=  {P(x,y;u,0)\x,y  €  X)  (2.1) 

and  initial  probability  mass  function 

p(0):={p(*;  *)]*€*}  •  (2.2) 

The  parameter  9  is  unknown,  but  belongs  to  a  known  set  9.  Assume  that  X M 


£ 


and  0  are  all  finite.  Further  assume  that  for 


x,y  e  X\  u  eU;  9,9'  G  0,  P(x,y;u,0)  >  0  =>  P{x,y;u,9')  >  0  ;  (2.3) 


for  every  stationary  control  law  g  :  X  — +  U 


P3{9)  :=  {P(x,y;s>(i),0)|x,y  G  *} 


is  irreducible  and  aperiodic  for  all  6  G  0  ,  and 


p{x\9 )  >  0  for  all  x  G  X  and  9  G  0 


*■‘(0)  :=  {*'(x;0)|x€  *} 


be  the  stationary  distribution  corresponding  to  P3(9)  and  let 

P9(0)  ■=  £  ^(x;  0)r(x,  g(x))  (2.7) 

z€X 

be  the  mean  reward  under  that  stationary  distribution. 

An  “adaptive  control  scheme”  7  is  a  sequence  of  random  variables  {f/n}£L0 
taking  values  in  the  set  U  such  that  the  event  {Un  =  u}  belongs  to  the  (7-field 
Tn  generated  by  Xo,  Uo,  Xu  Ui, . . . ,  Un-\ ,  Xn.  Let  r(Xi,  Ui )  represent  the  one  step 

rt-l 

reward  at  time  t,  where  r  :  X  x  U  — ►  R.  Further  define  Jn  '■=  X]  r(Xt,Ut)  the 

.=0 

total  reward  at  time  n  as  the  sum  of  the  one-step  rewards  upto  time  n. 


Our  objective  is  to  find  an  adaptive  control  scheme  7  which  maximizes,  in  some 
sense,  £«Jn  as  n  — *  00.  We  shall  clarify  this  notion  of  optimality  in  Section  2.4.  To 


•>  V  V 


achieve  our  objective  we  would  like  to  express  approximately  Eg  Jn  in  terms  of  the 
expected  number  of  times  each  of  the  stationary  control  laws  g  is  used  up  to  time  n, 
and  the  expected  one-step  reward  under  the  invariant  distribution  corresponding 
to  each  g.  For  this  purpose  we  need  to  translate  any  adaptive  control  scheme  7  to 
an  equivalent  adaptive  control  scheme  7'  with  the  following  features: 


(Fl)  The  control  scheme  7'  chooses  a  stationary  control  law  gn  (instead  of  a  control 
action  Un)  at  each  time  n. 

(F2)  Whenever  a  fixed  but  arbitrary  stationary  control  law  g ,  chosen  by  7',  is 
used,  the  sequence  of  states  observed  in  Markovian.  Moreover  the  sequence 
of  states  corresponding  to  the  different  stationary  control  laws,  chosen  by  7', 
are  independent  conditioned  on  the  initial  state. 

In  Section  2.2  we  identify  a  set  of  conditions  which  if  satisfied,  lead  to  a  control 
scheme  7'  that  has  the  above  features,  and  we  construct  such  an  equivalent  control 
scheme.  In  section  2.3  we  define  the  probability  space  (f \',T',P'g)  which  allows  us 
to  define  a  sequence  of  states  which  for  each  stationary  law  g  is  Markovian,  and 
independent  of  the  sequence  of  states  of  any  other  stationary  law  g',  conditioned 
on  all  their  initial  states.  Using  (fl',  T* ,  Pg)  and  7'  we  can  define  a  control  problem 
that  is  equivalent  to  the  original  one,  and  we  can  express  Eg Jn  in  terms  of  the 
expected  number  of  times  each  of  the  stationary  control  laws  g  is  used  up  to  n, 
and  the  expected  one-step  reward  under  the  invariant  distribution  corresponding 
to  each  g.  Such  an  expression  for  EgJn  allows  us  to  precisely  define  the  sense  in 


4 


which  we  want  to  maximize  it. 


2.2  The  Translation  Scheme 

Lemma  2.1  Given  a  controlled  Markov  chain  on  a  finite  state  space  X  and  with 
a  finite  control  set  U,  for  any  adaptive  control  scheme  7  (as  defined  earlier)  there 
exists  an  “equivalent  adaptive  control  scheme”  7'  taking  values  on  the  set  Q  :  = 
{<7  :  X  — +  U)  of  stationary  control  laws  with  the  following  properties. 

(i)  7'  is  a  sequence  of  random  variables  {5n}^o  taking  values  on  the  set  Q  such 
that  the  event  {<7*  =  <7 }  belongs  to  the  (7-field  Fn  generated  by  Xo,go ,  X\,g\, 
•  •  •  i  9n—  1 1  Xn. 

(ii)  Un(u)  =g„{Xn)(u)  Vn, u. 

(iii)  If  n*  and  nk+l  are  any  two  successive  time  instants  at  which  a  stationary 

control  law  g  (fixed,  but  arbitrary)  is  used,  i.e.  gnk  =  =  9  and  gn  i=- 

9i  <  n  <  rifc+i  then  —  Xnk^l. 

(Notice  that  (i)  implies  —  Fn.) 

Proof  (by  construction) 

Let  #  X  =  k  and  let  x1' ,x3' , . . .  ,xk'  be  a  prior  (but  arbitrary)  ordering  of  x. 
Similarly  let  =  /  and  U  —  {u1,  u2, . . .  ,ui}.  To  start  off  observe  Xq  and  then 

reorder  X  as  x1,!2,...,!*  by  a  left  cyclic  shift  of  the  prior  ordering,  such  that 
xl  =  Xq.  Define  Qq,  i  =  1, . . . ,  k  inductively  as  follows: 

5 


Go  =  {$  e  G  :  g{x3)  =  u\  1  <  j  <  A:} 

G'0  =  {5  €  G  :  g{x})  =  u\i  <  j  <  fc}  -  \J  Q30\  i  =  2,...,k. 

:= 1 


Notice  that  Q'a\  i  =  defines  a  partition  of  G,  i.e.  (J  Q'0  =  Q  and  i  ^  j  =>■ 

1=1 

Go  n  #  = 

Now  suppose  at  time  n  >  0,  i.e.  after  observing  Xn,  we  have  a  partition 
G'n  :  i  =  1, . .  • ,  k  of  G  with  the  following  five  properties: 


PI)  Q'n\  i  =  1, . . . ,  k  is  determined  by  !Fn 


P2)  V  1  <  i  <  A:  Q*n,  the  last  time  upto  time  n  —  1  that  the  control  g  was 

used  (if  any)  was  followed  by  the  state  x‘. 


Xn  =  r7"  for  some  jn  =  1 , . . . ,  it 


Then, 


P3)  V  jn  <  m  <  k  and  for  any  /m  :  (x1, . . .  ,xm}  — ►  U  there  exists  a  unique 

TO 

9  €  U  G\  3  ?|{r« . *"*}  =  fm 

i=l 

P4)  VI  <  m  <  jn  there  exists  a  unique  fm  :  {x1,...,xm}  — ►  U  3  Vg  6 

TO 

LJ  Gni  S|{x1,...rr"'}  ^  fmi 
1=1 

P5)  V  1  <  m  <  jn  the  above  found  s  satisfy  f,m_l  =  fm |{xi . *"*->} 


Also  assume  that 


P6)  g},  0  <  j  <  n  satisfy  properties  (i),  (ii)  and  (iii)  of  Lemma  2.1. 


We  shall  now  show  that  we  can  choose  a  gn  satisfying  property  (P6)  on  the 
basis  of  F'n  and  construct  a  new  partition  Gh+\',i  =  1  satisfying  properties 

(PI)  -  (P5)  assumed  true  for  time  n.  Choose  gn  6  Gi?  ( jn  as  determined  by  (2.8)) 
such  that 

0n[{*i . and  gn(xJn)  =  g„(Xn)  =  Un.  (2.9) 


Such  a  choice  is  clearly  possible  by  the  above  induction  hypothesis  (properties 
(P3)  k  (P4)).  By  noting  the  fact  that  Un  is  determined  by  Fn  =  F'n  and  by  the 
induction  hypothesis  (properties  (PI)  and  (P2)  and  (P6))  it  follows  that  (P6)  is 
satisfied  for  n  +  1.  Next,  let  Xn+1  =  xJ"+l  for  some  ;n+i  =  1, . . . ,  k.  If  ;„+i  =  >„ 
then  £;+1  :=  G\  Vi  =  1,...,*,  and  it  trivially  follows  that  £;+1,  i  =  1,...,* 
satisfy  (Pl)-(P5).  Else,  if  jn+i  *  jn,  Sj"+1  :=  Qin  ~  {<7*}.C+V  :=  ^"+1  + 
and  V  i  ^  jn,jn+i ,  :=  G„-  In  this  case  also  it  is  easy  to  check  that 

satisfy  (Pi)  k  (P2)  .  To  show  that  <?n+i  satisfy  (P3)-(P5)  consider  two  cases 

Case  1  j„+1  >  jn  ■ 


-  V  Jn+l  <  m  <  k 
satisfied. 


U«+. 


1=1 


m 


m 


Uc; -  (s~)  +  {*•>  -  u«  Thus  <P3> is 

i=i  i=i 


-  V  1  <  m  <  jn  =  |J  Gn-  Thus  (p4)  &  (p5)  ue  satisfied  for 


i=i  1=1 

1  <  m  <  ;n  and  1  <  m  <  jn  respectively 


-  v  jn<m<  jn+i  (J  Q™+1  =  (J  Q™  -  {3„}.  Consider  the  f'm  =  gn  |{r, . xm}. 

t=i  1=1 

By  the  induction  hypothesis  (P3)  it  then  follows  that  (P4)  is  satisfied  for 

Jn  <  m  <  Jn  +  l. 


Clearly  this  construction  of  f'm  also  satisfies 


fm- 1  —  /mil*1 . r’"-1}  ^  jn  <  m  <  jn+l 


and  by  (2.9)  it  also  follows  that 


,  . 

(new) 


Thus  (P5)  is  satisfied  for  jn  <  m  <  jn+\- 


Case  2.  jn+l  <  jv 


-  v  U  <m  <  k  (J$«+i  =  U^;  -  M  +  {<7«}  =  u  Ql  Thus  (P3)  is 

«=i  i=i  <=i 

satisfied  for  jn  <  m  <  k 

m  m 

-  V  Jn+i  <rn<jn  1J  a;+i  =  U  Gn  +  {Sn}-  And  since  f'm  =  £n|{xi . xm}  was 

1=1  1=1 

m 

the  unique  one  missing  from  (J  Q\  (by  (2.9)  and  induction  hypothesis  (P4), 

i=i 

(P5))  it  follows  that  (P3)  is  now  satisfied  for  jn+i  <  m  <  jn. 

m  m 

-  V  1  <  m  <  jn+i  [J  Q\+x  —  (J  G'n  and  thus  (P4)  &  (P5)  are  satisfied. 


The  proof  of  Lemma  2.1  is  now  complete  (using  induction)  by  checking  that 
the  induction  hypothesis  is  satisfied  at  n  =  0. 


wj.*  (.I j.) ■tt*ly.viv.i 


2.3  Extending  the  Probability  Space 


Let  Q  =  (X  x  U)°°  be  the  space  of  all  X  x  li  sequences  (i.e.  sequences  of  the 
type  X0,  U0,  Xi,  U\, . .  .)•  Give  ( X  x  U)°°  the  product  cr-field  T  —  (r{(X  x 
namely,  the  smallest  (7-field  such  that  Xq,Uq,X\,U\,.  . .  are  measurable.  There  is 
a  unique  probability  Vg  on  (0,.^)  such  that  for  all  n  and  all  x0, . . .  ,xn  in  X  and 


u0, . .  • ,  un  in  U, 


Vg  {Xi  =  Xi,  Ui  =  Ui,  for  i  =  0, 1, . . .  n} 

n  — 1 

=  p(xo;9)  n  p(xt,Xi+i;ui,0) 

i=0 

n 

XII  UTifao.tio.  •••,*<)  =  u,}  • 


(2.10) 


This  triple  (Q,P,Vg)  is  the  minimal  underlying  probability  space  required  for  the 
description  of  the  problem  we  address  in  this  paper. 

For  purposes  of  analysis  and  to  capture  feature  (F2)  it  is  useful  to  extend  this 
probability  space  which  we  shall  now  proceed  to  do  as  follows:  Let  Q  =  {g1,...,gd}, 
and  X*  =  {*  =  (x*\. .  :  x*'  e  X).  Let  f Y  =  (-V*)00  be  the  space  of  all  Xd 

sequences  (i.e.  sequences  of  the  type  2£o,2Ci,  •  •  •)•  Give  ( X d)°°  the  producet  <7-field 
T'  =  t7((Xd)°°),  namely,  the  smallest  (7-field  such  that  2£o,2Cit-  •  •  are  measurable. 
There  is  a  unique  probability  V'g  on  (O',  !F)  such  that  for  all  n  and  all  io.ij, . . . , 
in  Xd , 


wVJVV 


$ 

sa 


d 

•  •'I 


ss 


II 


^#{X=£i  for  i  =  0, 1, . . .  n} 


=)>;(/(!<,))  ri  ri  «) 

;=1  1=0 


(2.11) 


where  /  :  Xd  — ►  V  U  {A},  A  is  an  arbitrary  element  used  to  augment  the  state 
space  X  for  the  purposes  of  analysis,  and  /  is  defined  as  follows:  For  each  x  €  X 
left  cyclically  shift  {x1* . . .  x*  }  to  {x1, . . .  ,x*}  such  that  x1  =  x.  Consider  Ql0  (from 
section  2.2)  constructed  as  before  on  the  ordering  {x1, . . . ,  xfc}.  Let  h  :  X  — ►  Xd 
such  that  if  gJ  €  Q'0  then  hJ(x)  =  x*.  Clearly,  h  is  one-to-one,  but  not  onto.  Let 
h[X\  be  the  range  of  h,  and  h~l  :  h[X\  — »  X  be  the  inverse  of  h  on  its  range  (h~l 
is  well-defined  as  h  is  one-to-one.)  Finally,  let  /U[*|  =  A-1  and  /(i)  =  A  € 
Xd  -  A(«V],  and  p’t\x  =  p(0)  (defined  by  (2.2))  and  p*(A)  =  0. 

Now  on  this  probability  space  that  we  have  constructed  (note  that  there  is  no 
dependence  on  the  adaptive  control  scheme  7  so  far)  we  can  define  the  random 
process  Xq  ,  U^,  X?,  C/7, . . .  by  using  the  equivalent  adaptive  control  scheme  7'.  To 
start  off  let  Xq  :=  /(2Co)-  Now  given  X^,Uq,...,  X%  choose  adaptively  gn  such 
that,  U :=  gn(XJ)  and  X„+1  :=  Xj.*,+1  where  T%n  is  the  number  of  times  the  con¬ 
trol  law  gn  was  used  upto  time  n  (in  X0,  U0, . .  • ,  A"n),  and  is  the  component 

of  2Ct,?"+i  corresponding  to  gn.  It  can  be  easily  verified  that  the  random  process 
Xq  ,  Uq,  *7,  C/7,  •  •  •  constructured  above  has  the  same  distribution  (in  (fl',^,  Ve)) 
as  the  one  given  by  Note  that  for  2lo  3  f(Xn)  =  A  the  process  is 

undefined,  but  that  is  not  important  as  ’■  f{£o)  =  A}  =  0. 

Using  (n',r,Pl  )  and  7'  we  can  now  express  E»Jn  in  terms  of  the  expected 
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number  of  times  each  stationary  control  law  g  is  used  and  the  expected  one-step 
reward  under  the  invariant  distribution  corresponding  to  each  g. 


2.4  Analysis  of  the  Reward  Criterion 


Consider 


where 


i=0 

=  £r(A-„  [/,)£'(*  =  9)  £!(*.  =  *) 

<=0  geO  xeX 

=  EEE  r(X„t/,)l(9,  =s)l(X,  -x) 
gee  *ex  1=0 

=  £  'ErM*))N’(x,T!) 

gee  zex 


T*-l 

tff(*,rj)  =  t  W-*) 


(2.12) 


=  £  1(X,  =  z,$,  =  g) 


T’  =  £  1(*  =  9) 


(2.13) 


Note  that  in  the  extended  probability  space  (O',  7\  Vg)  T*  is  a  stopping  w.r.t.  the 

increasing  family  of  (7-algebras  {(  \/  ?£)  \J ?%}  where  T\  =  <r{X$,  Xf , . . . ,  X3n) 

g'ee 

g'+9 

and  TU  =  \ITI 


tm*. 
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To  express  EN9(x,  T9)  in  terms  of  the  invariant  distribution  under  g  and  ET9 
we  use  the  following  result: 

Lemma  2.2  Let  -Y0,  Xi,...  be  Markovian  with  finite  state  space  X,  transition 
matrix  P-irreducible  and  aperiodic  and  stationary  distribution  it.  Let  Tn  denote 
the  <T-algebra  generated  by  Xo ,  Xi,...,  Xn.  Let  Q  be  another  <r-algebra  and  A  an 
event  such  that  A  6  JFo  V  Q  and  {X0  =  x}nA={^  •  Furthermore  let  Q 

be  independent  of  .p*,  conditioned  on  the  event  A.  Let  r  be  a  stopping  of  {5v/n} 
such  that  -E^rlA]  <  oo.  Let 

JV(x,r)  =  £l(X,  =  x) 

i=0 

Then,  for  some  fixed  constant  K  ,  independent  of  A,  x  and  r. 

|£[JV(x, r)|i4]  -  k(x)E[t\A}\  <  K  (2.14) 


Proof:  Follows  from  Lemma  2.1  in  [11]. 


Notice  that  V  T9^  and  are  independent  conditioned  on  the  event  A£  = 


g'eff 

g'^9 


{JCo  =  *},£  G  Xd.  Moreover  A£  €  V  ?o  C  ((  V  H,)  V ?o)  and  W  = 


s€0 


g'& 

g'ttg 


x}f]{2L o=£}  = 


{2Co  =  x};{2G,  =  £}c{a:ob  =  x}. 
<j>  otherwise 
Therefore  by  Lemma  2.2  it  follows  that 


|£*[JV*(x,7;j)|,4i]  -  tt9(x;9)E9[T9\As\\  <  K 


for  some  fixed  constant  K  independent  of  x x  and  n. 

Thus, 

| Et[N>{x,T>)\  -  *>(x ,6)E,[T>]\  <  K  (2.15) 

From  (2.13)  and  (2.15)  it  follows  that 

\E6Jn-Y,H3W)E*T’n\<  K'  (2.16) 

where  I\'  is  independent  of  n  and  n3{9)  is  as  defined  by  (2.7).  Let  g*{0)  = 

arg  max(/i9(0))  ,  and  for  simplicity  assume  that  it  is  unique  for  each  0  €  0.  Thus 
g€S 

if  we  knew  the  true  parameter  the  control  scheme  gn  =  g*{0)  gives  the  optimal 
reward  (upto  a  constant)  for  all  n,  and  for  this  scheme 

\E9Jn-n^e\0)\<  K'. 

In  the  absence  of  the  knowledge  of  the  true  parameter  it  is  desirable  to  approach 
this  performance  as  closely  as  possible.  For  this  purpose  we  define  the  Loss  asso¬ 
ciated  with  an  adaptive  control  scheme  7, 

Ln(0):=nS'W(9)-E,Jn  (2.17) 

By  (2.16)  it  follows  that 

I Ln(9)~  Y.  (/(,)W-^(«I<  const.  (2.18) 


Maximizing  E$  Jn  is  thus  equivalent  to  minimizing  the  Loss.  More  precisely  we  want 
to  minimize  the  rate  at  which  the  Loss  increases  with  n  (e.g.  finite,  logrithmic, 


linear  etc.)-  Thus,  this  is  a  stronger  criterion  for  optimality  than  the  average 
reward  per  unit  time  criterion  (used  in  [1]  -  [7])  which  only  requires  the  Loss  to  be 
o(n).  In  view  of  (2.18)  the  above  problem  is  reduced  to  one  of  minimizing  the  rate 
at  which  EgTa  increases  for  g  €  Q,g  ^  g‘(9). 

Note  that  it  is  impossible  to  minimize  Ln(9)  uniformly  over  all  parameters 
9  €  0.  For  example  the  stationary  control  scheme  gn  =  3*(0)  for  all  n,  will  have 
a  finite  Loss  where  the  true  parameter  is  6.  However,  when  the  true  parameter 
is  O'  such  that  g’(O')  ^  <7*(0),  then  this  scheme  will  have  a  Loss  proportional  to 
n.  Having  made  this  observation  we  call  a  scheme  “uniformly  good”  if  for  every 
parameter  9  €  0 

Ln(0)  =  o(n°)  for  every  a  >  0  (2.19) 

Such  schemes  do  not  allow  the  Loss  to  increase  very  rapidly  for  any  9  6  0.  VVe 
restrict  our  attention  to  the  class  of  uniformly  good  schemes  and  consider  any 
others  as  uninteresting. 

3.  A  Lower  Bound  on  the  Loss 

In  this  section  we  obtain  a  lower  bound  on  the  Loss  Ln(9)  for  certain  values 
of  the  parameter  9  €  0.  Before  we  present  the  bound  we  introduce  the  necessary 
notation.  Let 


B(9)  :=  {O'  €  0  :  Pa'(>\9')  =  P*'W{9)  and  gm(9')  ^  g’(9)}  , 
Qs  :=  Q  ~  {<?•(*)}  , 


As  :=  Ucr’ig  e  Gs)  :  a3  >0,  2_,  ct3  =  1  >  , 

[  9€C»  ) 

d,(g)  :=  Ww(9)-l*'(0))  and 

I'M  T.  P'U.y.^°t~~v 

z€X  v€X  ) 


Note  that  I3(9, 9')  is  just  the  expectation  with  respect  to  the  invariant  measure  of 
P3{9 )  of  the  Kulback  Liebler  numbers  between  the  individual  rows  of  P9(9)  and 
P3(9')  thought  of  as  probability  distributions  on  X. 

The  bound  is  now  presented  in  the  form  of  Theorem  3.1  below. 

Theorem  3.1  Let  9  €  0  be  such  that  B(9)  is  non-empty.  Then  for  any  uniformly 
good  control  scheme  <t>,  under  the  parameter  9 , 


!)  Ws)<1  g2  •  Hg.  a>  I3  [0,9') 

c*  max  mm  ;  ,  . 

°€A,e'€B(«)  £c  a9  Mg) 


=0Vp>0. 


Consequently, 


t  Ln(9)  ^  .  Lc.a3d9(g ) 

urn  inf  - -  >  mm  max  p.1 - 

n-oo  logn  c.€>.  «'6S(»)  2^  <*a  I3 (9,  ff) 

<s. 


Proof 


The  proof  can  easily  be  obtained  from  that  of  Theorem  3.1  of  (8]  by  subsituting 
g  for  u  and  Q$  for  U9  and  by  invoking  the  ergodic  theorem  instead  of  the  strong 
law  of  large  numbers.  □ 


» 


w. 


This  is  the  set  of  parameters  for  which  the  optimal  control  laws  are  the  same  as 
that  for  9,  and  the  transition  probabilities  under  the  optimal  control  law  are  also 
identical.  Let 

G(S(6))  :  =  {g  :  P3(9')  /  P’(0),  9'  €  S(9)}.  (4.4) 

Recall  from  Section  3  that 

B(9)  :=  [9'  €  0  :  P3'w(9')  =  Pfl‘<#)(0)  and  gm(9')  ±  5*(0)}  .  (4.5) 


This  is  the  set  of  parameters  for  which  the  optimal  control  laws  are  better  than 
the  optimd  control  law  for  9 ,  and  the  transition  probabilities  under  the  optimal 
control  law  for  9  are  identical. 


Let 


a{9)  =  {a9(9)  :  g  6  £*} 


(4.6) 


achieve  the  minimum  in  the  lower  bound  for  the  Loss  in  (3.2),  where  Gs  —  G  — 
{$*(*)}  and 


V ^  =  Ei[inf{n  >  l\Xn  =  x0}|*o  =  x0)  ,  (4.7) 

be  the  expected  reccurrence  time  of  the  state  io  under  the  control  law  g  .  On 
the  basis  of  these  define, 


0(9)  =  {F(9)  :geGf}  with  03(9)  = 


Z,  «.»'(*)/%.' 


(4.8) 


V*-. 


.  *V7V5  '■L'v: J 


4.2  Description  of  the  Control  Scheme 

Let  x0  €  X  be  an  arbitrary  but  fixed  state.  Define  the  {Ft  =  cr(X0,  to,  A\, . . . , 
.V(_!.  Lrt-i,: V,)}  stopping  times  r0,  r1(...  by  rm  :=  inf{t  >  rm_,|.Yt  =  x0},m  > 
l,and  t0  =  inf{t\Xt  =  xo}-  The  control  scheme  we  construct  chooses  a  stationary 
control  law  at  times  0,  r0,  . . .  adaptively  on  the  basis  of  ail  the  past  observations 

and  past  actions,  and  use  this  control  law  till  r0  —  1,  Tj  —  1,  r2  —  1, . . .  respectively. 
That  is,  over  each  recurrence  interval  marked  by  the  state  Xo  we  use  the  same 
control  law  which  is  chosen  adaptively  at  the  beginning  of  that  block.  With  this 
in  mind  we  now  describe  how  the  choice  of  control  laws  is  made  at  the  beginning 

of  each  block.  From  now  on  we  shall  refer  to  the  actual  time  a is  time  and  the 

reccurrence  points  as  instances.  Initially,  i.e.  at  t  =  0  ,  choose  a  fixed  but  arbitrary 
control  law  g0  and  use  it  till  time  r0  —  1.  Then  to  start  off,  use  each  of  the  control 
laws  g  €  Q  once  each.  From  then  at  each  recurrence  point,  compute  the  empirical 
pair  measure  ft  :=  {ft(x,y)\x,y  €  X)  €  M ^  corresponding  to  each  g  6  Q  as 

ft(x,y)  :=  ■  g  1  £  1{<7, :  =  g,Xt  =  x,X,+1  =  y}  (4.9) 

in  “  ro  ,'sTQ 

where  n  is  the  actual  time 
Define  the  conditions 

Cl(0):  ftn  €  c-nbd  (ft)  V  g  €  Q  and  B(9)  is  empty 
C2(9):  ft  €  €-nbd  (ft)  Vg  €  Q  and  B(9)  is  non-empty. 

C3:  there  does  not  exist  9  €  0  such  that  ft  €  e-nbd  (ft)  V<?  6  Q. 
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(Note  that  C3  =  ( (J  (Cl(9)  U  C2(9)) )  ).  Proceed  as  follows. 

«ge 

1)  If  C\(9)  is  satisfied  for  some  0  G  0  then  use  gm(9). 

2)  If  C2{9)  is  satisfied  for  some  9  €  0  then  do  the  following:  Maintain  a  count  of 
the  number  of  instances  condition  C2(9)  is  satisfied.  Of  these,  for  the  first  instance 
choose  among  those  control  laws  g  G  Qg  randomly  with  probabilities  03(9).  Refer 
to  this  process  as  “randomization”.  For  those  instances  when  this  count  is  even 
(call  this  situation  C2(9)  a)  use  g’(9).  For  other  instances  when  the  count  is  odd 
(call  this  situation  C2(9)  b)  compute  the  likelihood  ratio 

...  _  .  1?  P,:  (Xf,Xf„;») 


of  9  vs  B(9),  where  XT0,gT0,X[,. . .  ,gTr r-pAjv  is  the  sequence  of  pairs  of  control 
laws  used  and  states  observed  upto  time  n  when  “randomization”  is  done  with 
0(9).  If  A„  >  /<■„+!  (say  C2(9)b\),  where  Kn  =  n(logn)p  for  some  fixed  p  >  1,  the 
use  g’(9).  If  A„  <  Kn+ 1  (say  C2(9)b2)  then  do  the  following:  Maintain  a  count 
of  the  number  of  instances  this  condition  (C2(0)62)  is  satisfied.  If  this  count  is  a 
perfect  square  (say  C2(0)62a)  then  use  round  robin  amongst  g  G  G(S(9)).  If  this 
count  is  not  a  perfect  square  (say  C2(9)b2b)  then  do  “randomization”  using  0(9). 


3)  If  C 3  is  satisfied  then  use  round-robin  amongst  g  G  Q. 


S'* 


4.3  Upper  Bound  on  the  Loss 


In  this  section  we  derive  an  upper  bound  on  the  Loss  associated  with  the 
adaptive  control  scheme  7*  constructed  in  Section  4.2.  The  bound  is  given  by  the 
main  Theorem  4.2.  Lemmas  4.1,  4.2,  4.3  and  Theorem  4.1  are  needed  for  the  proof 
of  the  main  theorem. 

Lemma  4.1:  Let  X0,Xi,...  be  Markovian  with  finite  state  space  X,  transition 
matrix  P,  invariant  distribution  t,  and  initial  distribution  p.  Let  M (2>  be  the 
unit  simplex  on  R)x^  identified  with  the  space  of  probability  measures  on  A2,  and 
let  K  C  A/(2\  closed,  such  that  ttP  £  K.  Let  pn  :=  {pn(x,y)\x,y  G  A}  where 

1  n  — 1 

Pn(x,y)  :=  -  £  H*.  =  =  y}-  Then 

n  «=o 

(i)  P(pn  G  K)  <  Ae~an  for  all  n  >  1  for  some  positive  constants  A,  a. 

Let  N  :=  £  1  (pn  6  K).  Then 

n=  1 

(ii)  EN  <  00 

Let  L  sup{n  >  l|pn  G  K }  .  Then 

(iii)  EL  <  00 

Proof: 

Part  (i)  follows  from  the  theory  of  large  deviations.  See  [14],  Problem  IX. 6. 12. 

EN  =  f.  p(p«  e  X) 

nsl 

<  £  Ae-* 


<  oo  which  proves  (ii) 
EL  =  E  ^  1(3  t  >  n,p,  €  /<) 

n=  1 

=  fif;  i(U(fteA')) 

n= 1  i  >n 

oo  oo 

sEE  e  a) 

n=l  tsn 
oo  oo 

<  Ae~ai 

ns  1  issn 

<  oo  which  proves  (iii), 


Lemma  4.2:  Let  S„  =  A-!  +  . . .  +  Xn  where  Xi,X2, ...  are  i.i.d.,  EX\  >  0  and 

OO  OO 

let  N  =  Y'  1(5*  <  0),  L  =  1  ( inf  St  <  0).  Then  the  following  are  equivalent: 

n=l  n=l 

(a)  £(|Xi|*  l(X1<0))<oo. 


(b)  EN  <  oo. 


(c)  E  L  <  oo. 


Proof:  See  Hogan  [15]. 

Lemma  4.3:  Let  Xx,X2,...  be  i.i.d.  Let  /'  be  a  real  valued  Borel  function  such 
that  0  <  £/*( Xt)  <  oo,*  6  /,  finite.  Let  S'n  =  /•(*,)  +  f'{X2) . . .  +  /’(Xn),  L\  = 

OO 

g  1  (inf  SJ  <  A),  and  LA  =  max  L\.  If  E(\f'(X\)\2  l(f'(X\)  <  0))  <  oo  for  all 

ns  i 

i  €  I,  then 


limsup  — — ^  < - — - 

A  -  nu n(Ef'(Xx)) 
»€/ 


(4.10) 
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Proof:  For  s  >  0,  and  for  any  fixed  i  €  / 


L\  <  4444  +  V 


Ef'(X  i) 


(4.11) 


where 


-S  •(*(*-««) 


Consider  the  i.i.d.  r.v.'s 


7.  tit  y\  Ef'(Xt) 

z<~f  ™  ‘  TTT 


We  have, 

E{\Z\\2\{Z\  <  0)}<2  E 


\f(Xx)\2  + 


m) 


21 


(4.12) 


1  -f(Xi)< 


^i) 


<  2£{|/'(.V,)|Il(/'(X1)<0)} 

+  2E ||/'(.Vi)!s  1  (o  <  /■■<*,)  <  -f™)}  +  2 


<  OO  . 


Then,  by  Lemma  4.2  it  follows  that  EL'  <  oo. 


Therefore 


E  max 

.€/ 


L')  =E^‘  =  *(*)<«>  - 


«€/  /  i€/ 

for  some  constant  &(£)  independent  of  A. 


Now, 


r  r,  ^  (  M\  +  £)  ,  r 

Li  =  max  L.  <  max  —  — — —  +  L, 
.e/  4  “  .e/  \Ef'{X 0 

-4(1  +  e) 

-  ^(E/ivo)  +  L 


(4.13) 


(4.14) 


By  (4.11)  and  (4.12)  it  follows  that 


ELt  <  ^lyrlr,  +  m 
nun (Ef  (X,)) 

«€/ 

ELa  1  +  £ 

T-»P  A  ~  min(£/’(Xi)) 
•  €/ 


By  letting  £  — ♦  0  we  get  the  desired  result. 


Theorem  4.1  Let  9  €  0  be  such  that  B(9)  is  non-empty.  Then, 


(1)  limsup  E9  53  l(Arm(0)  <  Kn+i)  /logn  < 


(4.15) 


(2)  Ps<{A,(0)  >  Kn+ 1  for  some  1  <  i  <  n}  <  — —  for  O'  6  B(0). 

Xn+i 


(4.16) 


Proof: 


Let  Xq,X[,.  . .  be  the  sequence  of  observed  states  when  “randomization”  is  used 
with  a(0).  Let  Xm  =  (J  X‘,  with  the  Borel  <r-algebra  of  the  discrete  topology, 

t>i 

i.e.  all  subsets  are  measurable.  The  process  {Xt}t>0  allows  us  to  define  X *  valued 
random  variables  called  blocks  as  follows:  Define  the  {.£<}  stopping 


times  rk,k  >  1  by 


rk  =  inf {t  >  r*_i|Xfr  =  Xq  =  r0} 


with  r0  =  0.  (Note  that  rk  <  oc  a.s.)-  Then 


Let  £?£  =  (Bk,gk).  Since  the  same  control  law  is  used  over  the  entire  block,  and  the 
choice  of  the  specific  law  for  each  block  is  made  by  independent  randomizations  at 
the  beginning  of  the  block  it  can  be  easily  shown  that  {£?£}  are  i.i.d. 
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Then 


E.ir'wixo  =  xoi  =  e’ 
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-  E?(*)6  £  jv(x,j,i),)log^^|X„  = 

c«  |_x,ve/f  *  vx>  »> p ) 


=  I>(«)  £  ^M)^*,**)^**  £*!*’*!?? 

C.  r,V€*  /'»(vx,y,tr; 

=  £/»•  wt, <n 

c. 


£,((/,'(Bl))n(/''(Bi)  <  0)|Xo  =  i0] 

=  <  o)i.v„  =  x„i 

it 
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P*(flt;0'|Xo  =  x0) 


P»(flfc;gl.Yo  =  x0) 
P»(flt;tf|Xo  =  *o) 
pn^;^[-Yo  =  xo) 
P»(Bt;tf'|X0  =  x0) 


/  P»(Bfc;g|.Y0  =  xo)y  (P'(Bk-,9 \Xq  =  x0)  \ 
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Thus  by  Lemma  4.3  we  have  the  desired  result  (i). 


To  prove  (ii)  note  that 


(A,(0)  >  A:n+1  for  some  1  <  i  <  n} 

=  {«)  3  >  *« tor  s°m' 1  £  ■■ s  n} 

s  (3  >  tn" for 1  s  s  n} 

for  any  O'  6  B(0),  and  ( JJ  pgwyr.’  yl+'I'll  I  ‘S  a  ^  martin8ale 

L 1=0  *  *  l-'**  >  ■'M+lt  “  )  )  t>1 


under  O'  with  mean  1. 

Thus  the  result  follows  by  the  submartingale  inequality  (see  [13],  pg  243).  t 
Theorem  4.2:  Under  the  adaptive  control  scheme  <j>* ,  for  g  ^  <?*(0) 


(i)  E9TZ  < 


<*{9) 


+  o(l) 


logn  if  B(0)  is  non-empty 


EeT*  <  oo 


if  B(6)  is  empty  . 


(4.17 


Consequently 


£  a’W  d,(g)  \ 

max  =r - —  +  o(l)  logn  if  Bid )  is  non-empty 

*'€B(e)  a3(e)I3(e,9')  V 

/ 

Ln(<?)  <  oo  if  f?(0)  is  empty  (4.18) 


where  a(0)  =  {a9{&)  '■  g  €  Gs}  is  defined  by  (4.6). 


Proof:  As  in  Section  4.2  define  the  [ft{=  <j{Xq,  Uq,  X\, . . . ,  Xt-\,Ut-i,Xt))}  stop¬ 
ping  times  r0,  rlt . . .  by  rm  :=  inf{t  >  rm_ i\Xt  =  io}  with  r0  =  inf{n|X„  =  x0}. 
Then  rm  <  oo  a.s..  Then  for  any  n  >  0,  any  g  €  G«  we  have 


T'  =  Ei(?,=?) 

t=0 

<  YL  l(9r.  =9)(T>+1  ~  A)  +1o 

t:r,  <n 


since  the  choice  of  5’s  is  only  made  at  the  stopping  times  r,.  So 


E9T l  <  EaY  HSn  =  s)(T;+i  ~  T*)1(Ti  <  n)  +  E$t0 
.=0 

=  £  £*[£*[l($Ti  =  g)l{r,  <  n)(rt+i  -  r,)| £Tl]]  +  £tfr0 
1=0 

=  Y1  EflHdT,  =  g)l{T  <  n)£j[(r,+1  -  r,)|£Ti]]  +  Eer0 
1=0 

=  =  $)l(r.  <  n)V*o\  +  E,T0 

1=0 

=  V,r0Ei  Y  l(Sr,  =g)  +  E9T,  . 

i:r,<n 


Let  us  now  examine  the  term  ^  1{G,  =  g),  where  G,  =  gTi. 

i.r,<n 
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5: 

r,  <n 

=  1  +  = 
t>d:rt  <n 

=  1+  ^  l{<j,  =flf,  CT(0')  is  satisfied  at  stage  i  for  some  0' €  0} 

«>d:r,<n 

+  1{^»  =  C2(0')  is  satisfied  at  stage  i  for  some  O'  €  0} 

i>d:r,<n 

-f  ^  1  {Gi  =  g,  C 3  is  satisfied  at  stage  1} 

i>d:Ti<n 

=  1+  Term  1  +  Term  2  4-  Term3  (say),  (4-19) 

where  C\(0'),  C2(0')  and  C3  are  defined  in  Section  4.2  and  d  is  the  cardinality 
of  the  set  Q  of  stationary  controls.  Let  us  now  examine  each  term  separately. 
Defining  C9  by 


a  :=  sup  {pi  i  e-nbdW)}  .  (4.20) 

T»>  1 

and  noting  that  EgC 9  <  00  by  Lemma  4. l(ii),  we  get 
Term  3  <  & »  thus, 

E9  Term  3  <  £  E*  <  00  »  (4.21) 

j€C 

and  Term  1  <  C9,  thus, 

E$  Term  1  <  E$C9  <  00  .  (4.22) 


y:  l{Gi  =  g ,  C2{0' )  is  satisfied  at  stage  »  for  some 

«>d:r,<n 


Term  2 


O'  6  0  such  that  i/|,  ^  ^  u9d  ’} 

4-  ^2  1{G,  =  g,  C2(9')  is  satisfied  at  stage  i  for  some 

t>d:r,  <n 

O'  £  0  such  that  0  €  £(0')} 

+  1{G,  =  <7,  C2(B')  is  satisfied  at  stage  t  for  some 

i>d:rt  <n 

S'  £  0  such  that  #  6  •?(#')} 

+  ^2  1  { G\  =  g ,  C2(0)  is  satisfied  at  stage  i}  . 

»>d:T,  <n 

=  Term  2a  +  Term26  4-  Term2c  +  Term2d  (say)  .  (4.23) 


Next  we  upper  bound  each  of  terms  2a  -  2d  separately. 


Term  2a  = 


Y.  1{G,  =  g,C2(0')\s  satisfied  at  stage  i} 


8’:B(8')  is  empty  and  t>d:r,<« 


8'  B(S')  it  not  empty  and 


14-  ^2  1{G,  =  g’{0'),C2{0')  is  satisfied  at  stage  (} 

.  *>d:r,<n 


(£J-(«')  +  1) 


8':B(6')  ia  not  empty  and 

Q*  '  • 


The  first  of  the  inequalities  of  (4.24)  holds  because  under  C2(0'),  g‘{0')  is  chosen 
on  all  the  even  instances,  therefore,  on  at  least  as  many  instances  as  any  other 
control  minus  one.  The  second  of  the  inequalities  of  (4.24)  holds  because  the 
sum  on  the  left  hand  side  counts  a  subset  of  the  times  when  gm(9')  is  used  and 
pn(gm(0'))  &  e-nbd  (i/f  ')  where  0  is  the  true  parameter. 
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By  Lemma  4. 1  (ii)  it  follows  that 


Eg  Term  2a  < 


(1  +  Eg  C9'^)  <oc 


8’:B(8')  is  empty  and 


(4.25) 


Term  2b  <  52  52  {C2(0')  is  satisifed  at  stage  i} 

B’.8€.B(8’)  i>d:r,<n 


<  ^  2  1+  52  1{C2(0')£>  is  satisfied  at  stage  i}| 

8'-.8tB(8')  i>d:r,<T»  ■> 


=  52  2  1+  52  1{C2(0')61  is  satisfied  at  stage  i} 

S’:8€B(8')  *■  >>d-.r,<n 


+  52  1{C2(0')62  is  satisfied  at  stage  i} 

i>d:r,<n 

<  E  2[i+  E  l{An(«')>^r,+i} 

8':8eB(8')  i>d:r,<n 


+  52  1{C2(0')62  is  satisfied  at  stage  i} 

i>d:r,<n 


<  51  2  1  +  52  1  {*,(*')  >  K*  for  some  j  <  i  -  1} 

8'.8€B{8') 


i—d 


+ 


52  1{C2(0')62  is  satisfied  at  stage  i}  4.26) 


i>d:r,<n 

The  first  of  the  inequalities  of  (4.26)  results  by  removing  the  condition  G,  =  g. 
The  second  one  results  by  observing  that  the  total  number  of  time  instants  that 
C2(9')  is  satisfied  is  upperbounded  by  twice  the  odd  instants  that  C2(9‘)  holds, 
and  by  noting  that  the  first  time  we  randomize  and  the  other  odd  times  we  call 
C2{9')b.  The  third  inequality  results  because  {C2(0')62  is  satisfied  at  stage  i) 
implies  {AT,(^)  >  A’r.+i}- 


Consider  now  the  term  52  1{C2(0')62  is  satisfied  at  stage  t}. 

«>d:r,  <n 
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1{C2(0')62  is  satisfied  at  stage  i} 

t  >  d :  r,  <  n 

=  ^  l{C2(#')62a  is  satisfied  at  stage  2}  +  ^  l{C2(#')62i  is  satisfied  at  stage  2} 

t><f:r,<n  i>d:rt<n 

<  1+2  1{C2(0')626  is  satisfied  at  stage  2} 

<n 

=  1+2  ^  1{C2(0')626  is  satisfied  at  stage  2;  of  the  number  of  instances 

1  >d-.T,<n 

that  C2(0')b2b  has  been  satisfied  so  far,  the  fraction  of  instances 
that  g'  is  chosen  6  (; 3 9  (O')  —  e,  Q9'  (O')  +  e)} 

+  2  Y2  l{C2(0')b2b  is  satisfied  at  stage  2;  of  the  number  of  instances 

i>d:rt  <n 

that  C2(0')b2b  has  been  satisfied  so  far,  the  fraction  of  instances 
that  g'  is  chosen  (fi9  (O')  —  e,  @3'  (O')  +  e)} 

<  1  +  2  Yl  £  c-nbd  (v9t  )  for  some  i  >  ((3a’(0')  -  e)>} 

j=i 

OO 

+  2  r.  1  {of  j  the  fraction  of  instances  g  is  chosen  £  (03(O')  —  t,09'(O')  +  e)}  (4.27) 

where  g'  €  Qe<  is  such  that  uj  ^  . 


The  first  of  the  inequalities  of  (4.27)  results  by  observing  that  the  number  of 
instances  when  condition  C2(0')62a  is  satisfied  (i.e.  the  count  of  the  number  of 
instances  C2(0')b2  is  satisfied  is  a  perfect  square)  is  upper  bounded  by  the  number 
of  instances  when  condition  C2(0')b2b  is  satisfied  plus  one.  Consider  now  changing 
the  index  of  summation  to  the  instances  when  randomization  is  done.  Then  the 
condition  C2(0')b2b  along  with  the  condition  that  the  fraction  of  instances  that  g' 
is  chosen  6  (J9‘ (O')  -  e,  fl9'(0')  +  e)  at  stage  t,  imply  that  p,(g')  &  e-nbd  (ui  or 
some  2  >  (39'  (O')  —  e)j.  By  extending  the  summation  to  infinity  together  with  the 


above  observation  establishes  the  last  of  the  inequalities  of  (4.27). 


Thus,  by  Lemma  4. 1  (i)  and  (4.16)  it  follows  that 


E$  Term  2b  <  £  2  1  +  £(t(log  t)p)_l  +  1 

$'-.6€B(6')  t=d 


+  2£  J2  ^ie-ai,  +  2  5>2e-a" 

j=i  •>(/} »'(«')-<);  J=1 

<  oc 


(4.28) 


where  Ai,  ai,  A?,  a2  >  0  are  some  constants. 


Term  2c  =  ^  { G ,  =  5,  C2(0')  is  satisfied  at  stage  1} 

»':#€S(9')  t>d:r,<n 


<  E 


1  +  ^2  1  {G,  =  g ,  C2(0')62  is  satisfied  at  stage  1} 


11 8€.S{B ')  l  >>d:r,<n 


£  E 

9'9€S(#') 


1+  \{C2(9')b2  is  satisfied  at  stage  1} 

i><<:r,  <n 


<  E 

9':965(9') 


1  +  l2  +  E  H M)  t  e-nbd  K')}(2>  +  1)/J 

j=i 


(4.29) 


t 


where  g'  £  Q{S{9'))  is  such  that  i/£  ^  i/|,  and  #  £(S(0'))  =  /. 


The  first  inequality  of  (4.29)  results  by  noting  that  since  9  £  S(9'),  g  ^  gm{9')  = 
g’(9)  can  be  chosen  only  when  condition  C2(9')b2  is  satisfied,  or  at  the  first  instance 
when  C2{9> )  is  true.  The  second  inequality  results  by  removing  the  requirement 
Gx  —  g.  The  third  inequality  results  by  upperbounding  the  number  of  instances 
condition  C2(9')b2  is  satisfied.  This  can  be  achieved  as  follows:  First  restrict 
attention  to  those  instances  that  are  perfect  squares  and  the  control  g'  is  used.  At 
these  instances  since  C2(9')  is  satisfied  pn(g')  £  e-nbd  (i4<),  thus,  by  the  choice 


of  g'  €  G(S(Q')), Pn(g')  &  c-nbd  ( i/|  )).  Consider  the  sum  of  the  intervals  between 
the  above  instances.  (Note  that  the  length  of  the  jth  interval  is  upperbounded  by 
[0  +  l)2  -  J2]l2  =  (2 j  +  l)/2.)  Then  the  number  of  instances  condition  C2(0')b2 
is  satisfied  cannot  exceed  this  sum.  Finally,  the  inequality  results  by  changing  the 
summation  index  to  all  the  times  when  g'  is  used  and  upperbounding  the  interval 
following  the  time  p^g')  e-nbd  (t/g  )  by  (2 j  +  1  )/2. 

Again,  by  using  Lemma  4. 1  (i)  we  get 

Eg  Term  2c  <  ^  1  + 

Now  if  B(9)  is  empty  then. 

Term  2d  =  0  (4.31) 


/2  +  £  Ae-aj  •  (2j  +  l)/! 


(4.30) 


Otherwise, 


Term  2d  =  ^  1{G,  =  g,  C 2{0)  is  satisfied  at  stage  i} 

i>d:r,<n 

<  1  +  MGj  =  C2(0)b2  is  satisifed  at  stage  i} 

i>(<:rj<n 

=  1  +  MG,  =  <7,  C2(0)62a  is  satisfied  at  stage  i} 

<n 

+  MG.  =  g,  C2(0)b2b  is  satisfied  at  stage  t} 

i>d:r,<n 

<  2  +  ^2  1{G,  =  g ,  C2(6)b2b  is  satisfied  at  stage  i} 

+  (  1  {C2(0)b2b  is  satisfied  at  stage  i}  J  (4  32) 

<n  / 


The  first  of  the  inequalities  of  (4.32)  is  obtained  by  noting  g  ^  g‘(0)  can  be  chosen 
only  at  the  first  instance  when  C 2(0)  is  satisfied  (in  which  case  randomization  is 
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done)  or  when  C2(9)b2  is  satisfied.  The  last  of  the  inequalities  of  (4.32)  results 
because  the  number  of  instances  condition  C2(6)b2a  is  satisfied  is  upperbounded 
by  one  plus  the  square  root  of  the  number  of  instances  C2(0)b2b  is  satisfied. 


To  upperbound  Eg  Term  2d  we  use  (4.32),  Jensen’s  inequality  and  the  following 
fact:  At  each  instance  i  when  condition  C2(6)b2b  is  satisfied,  the  choice  of  the 
control  law  G,  €  Qg  is  made  by  an  independent  randomization  (3(0).  Then, 

Eg  Term  2d  <  2  +  ^  Pg{C2(6)b2b  is  satisfied  at  stage  i}  •  03{9) 

i>d:r,<n 

"/  \,/! 

+  I  YL  Pg{C2(0)b2b  is  satisfied  at  stage  *}  ) 

\>>d.r,<n  ) 


<  2  +  (33(9 )  Eg  [sup  { 1  <  i  <  n|Aj(^)  <  Kn+l}] 


+  (Eg[SuP{\<k<n\\,(6)<Kn+1}}y'* 


(4.33) 


Using  (4.15)  we  get 


limsup  Eg  Term  2d/logn  < 


_ W) _ 

’  Ct 


(4.34) 


Comb  ning  (4.19),  (4.21),  (4.22),  (4.23),  (4.25),  (4.28),  (4.30),  (4.31)  and  (4.34) 
we  get  (4.17).  (4.18)  follows  easily  from  (4.17)  and  (2.18). 


5.  Conclusions 


In  this  paper  we  considered  the  problem  of  adaptive  control  of  Markov  Chains. 
The  optimality  criterion  used,  namely  minimizing  the  rate  at  which  the  Loss  in- 


creases  is  stronger  than  the  average  reward  per  unit  time  criterion.  Multi- armed 
bandit  problems  with  “ Loss ”  as  the  optimality  criterion  is  one  class  of  stochastic 
adaptive  control  problems  that  has  previously  been  analyzed.  Therefore  one  way 
to  proceed  with  our  problem  is  to  relate  it  to  the  multi-armed  bandit  problem,  like 
was  done  in  [8]  for  the  controlled  i.i.d.  process  problem.  The  translation  scheme 
and  the  extended  probability  space  are  crucial  in  allowing  us  to  view  the  adap¬ 
tive  control  of  Markov  chains  as  a  multi-armed  bandit  problem.  The  stationary 
control  laws  correspond  to  the  “arms”,  and  the  sequence  of  states  observed  when 
any  particular  stationary  control  law  is  used  are  Markovian.  The  formulation  then 
resembles  that  of  the  multi-armed  bandit  problem  in  [11],  part  II.  One  very  impor¬ 
tant  difference  between  our  problem  and  that  of  [11]  is  that  the  parametrization  of 
the  “arms”  in  our  problem  is  not  independent.  This  difference  is  reflected  in  the 
lower  bound  on  the  Loss  we  obtain  in  Section  3,  and  also  needs  to  be  kept  in  mind 
when  designing  an  optimal  scheme  like  the  one  of  Section  4.  The  control  scheme 
presented  in  Section  4  has  an  intuitively  appealing  structure  as  it  clearly  specifies 
the  conditions  under  which  there  is  either  only  identification,  or  only  control,  or 
identification  and  control,  and  treats  each  one  of  these  conditions  optimally. 
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